We're going to analyze the worldwide box office returns of movies based on their genre.
This is the second step in our analysis. The first step was only considering the domestic market and can be found in the Domestic.ipynb notebook.
To recap our findings there:
It's time now to see what the addition of the worldwide market does to our study!
It's difficult to pin down exact totals for the domestic versus worldwide box office, but it's clear with each passing year, more and more of a movie's gross comes from abroad.
Statista has 2018 totals being $11.9 billion for the domestic market and $29.2 billion for the international market.
That puts the total box office ratio at 29% domestic and 71% international.
We will use the same approach as our domestic analysis:
Decide which genres to consider in our analysis.
Decide on a profitability measure.
Whittle our dataset down to movies released by the Big Five. This is to control for factors like lack of budget or marketing affecting a movie's success. Sure, independent movies can succeed, but our bosses want to see how the genres performed with all the help of a studio behind them.
Analyze the historical performance of the genres by decade. There might be some trends over time that would be useful to know.
Analyze the historical performance of the genres by release week. Maybe certain genres perform better at certain times of year.
Incorporate our findings into a strategy for our bosses!
According to The Numbers, the top six genres in terms of box office gross are:
Our bosses like making money.
Sold.
The question: How can we determine a movie's profitability?
The problem: Hollywood isn't exactly known for its transparency.
Exhibit A is Harry Potter And The Order Of The Phoenix. It made 938 million dollars at the box office and yet somehow lost the studio 167 million dollars.
Exhibit B is My Big Fat Greek Wedding. The movie had a budget of 6 million dollars. It earned over 600 million dollars from worldwide box office receipts, home video sales, rentals, and television broadcast rights. The movie still somehow lost 20 million dollars.
Exhibit C is Return Of The Jedi. It made 572 million on a 32 million budget, and still isn't profitable.
I could go on and on, but let's just leave it at that.
Even if we put Hollywood's accounting shenanigans aside for a moment, its very difficult to know how much money the studio even keeps from the reported worldwide box office figure.
From our analysis in the Domestic.ipynb notebook, we know that studios generally keep about 50% of domestic box office ticket sales. (Although even this is an estimate, and the amount varies by movie theater chain, distribution company, how long the movie has been out, etc.)
But what about worldwide? There's no simple way to sum up all the many differences in each country.
We must go with an estimate for the purposes of our analysis.
A good rule of thumb is that a movie breaks even when it earns twice its production budget worldwide.
We can use this rule of thumb, but we still need an equation we can use to analyze profitability.
If we assume the studio keeps about 50% of all worldwide box office money, then the amount of money a studio earns is Worldwide Box Office / 2.
If we include marketing costs into the movie's expenses, then we can say the total cost of a movie is 1.5 * Production Budget. (This is assuming each movie spends an additional 50% of its production budget as a marketing budget. Marking budgets vary wildly, so this is again a simplification.)
The breakeven point is where total earnings equal total expenses: (Worldwide Box Office / 2) = (1.5 * Production Budget)
Or, to simplify: Worldwide Box Office = 3 * Production Budget
At the point where the worldwide box office has earned three times the original production budget (or two times its production budget when adjusted for marketing costs), we shall say the movie has broken even.
This explains why we have created a column called worldwide_breakeven that performs this calculation to classify our movies.
We derive our profits equation from the breakeven equation, as profits are what remain after subtracting expenses from earnings.
Profit = (Worldwide Box Office / 2) - (1.5 * Production Budget)
If the result is 0, the movie broke even.
If the result is positive, the movie made money.
If the result is negative, the movie lost money.
This explains why we have created a column called profit that performs this calculation to determine the amount of profit/loss each movie made.
In this section of the notebook, we:
We import a few libaries and set some global Jupyter notebook settings.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
# For creating colormaps
import matplotlib.cm as cm
plt.style.use('fivethirtyeight')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))
pd.options.display.max_rows = 400
pd.options.display.max_columns = 50
We import the data and create a few columns we will use in our analysis.
Columns we are adding:
release_decade - We calculate the decade a movie was released using its release year.
worldwide_breakeven - This is a boolean column (i.e. either True or False) based on whether the movie broke even according to our Profitability Equation.
profit - This is a numerical column that calculates the amount of profit a movie earned. We take the amount of worldwide box office dollars it earned, divide it by 2, and then subtract away 1.5 times its production budget. Any money left over is profit.
action, adventure, comedy, drama, horror, thriller_suspense - These columns are boolean columns (i.e. either True or False) that convey whether the movie in question is of the corresponding genre. For example, if a movie has the genre Comedy Drama, it will have a True value in both the comedy and drama column. This is very useful for separating our dataset by genre for graphing purposes.
# Read in our dataset
data = pd.read_csv('cleaned_movie_data.csv', parse_dates=['release_date'], usecols=['title', 'distributor_mojo', 'worldwide_adj', 'budget_adj', 'genres_mojo', 'release_year', 'release_week', 'release_date'])
# Only look at movies that made money
data = data[data['worldwide_adj'].notna() & data['worldwide_adj'] > 0]
# Only look at movies with budget information
data = data[data['budget_adj'].notna()]
# For decade analysis
data['release_decade'] = data['release_year'].apply(lambda x: x // 10 * 10)
# For breakeven analysis
data['worldwide_breakeven'] = data['worldwide_adj'] >= 3 * data['budget_adj']
# For profit analysis
data['profit'] = (data['worldwide_adj'] / 2) - (1.5 * data['budget_adj'])
# Create columns for genres
# A movie can have multiple genres. If so, we will count them for all the genres its classified with.
data['action'] = data['genres_mojo'].str.contains('Action', na=False)
data['adventure'] = data['genres_mojo'].str.contains('Adventure', na=False)
data['comedy'] = data['genres_mojo'].str.contains('Comedy', na=False)
data['drama'] = data['genres_mojo'].str.contains('Drama', na=False)
data['horror'] = data['genres_mojo'].str.contains('Horror', na=False)
data['thriller_suspense'] = data['genres_mojo'].str.contains('Thriller|Suspense', na=False, regex=True)
# Remove rows that don't contain one of our genres
data = data[data['action'] | data['adventure'] | data['comedy'] | data['drama'] | data['horror'] | data['thriller_suspense']]
data.info()
The Big Five studios we will use in our analysis are:
Studios have come and gone a lot historically. They get bought out by competitors, or go out of business. A lot of messy stuff.
To simplify, we will categorize a movie by its current studio owner. So for example, Disney recently purchased 20th Century Fox. So we will categorize a 20th Century Fox movie as Disney.
print('Here are all the distributors currently in our dataset:')
data[data['distributor_mojo'].notna()]['distributor_mojo'].value_counts()
# Create a regex string to combine movies into their respective distributor
# https://en.wikipedia.org/wiki/Major_film_studio#Past
nbcuniversal = 'Universal|Focus Features|Focus World|Gramercy|Working Title|Big Idea|DreamWorks$|Illumination|Carnival|Mac Guff|United International'
print('Here are all the distributors that fall under the Universal umbrella:\n')
print(data[data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
viacom = 'Paramount|BET|Comedy Central|MTV|Nickelodeon|Bardel Entertainment|MTV Animation|Nickelodeon Animation Studio|Awesomeness|CMT|Melange|United International Pictures|VH1|Viacom 18 Motion Pictures'
print('Here are all the distributors that fall under the Paramount umbrella:\n')
print(data[data['distributor_mojo'].str.contains(viacom, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(viacom, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
warnermedia = 'Warner Bros.|CNN Films|HBO|DC Films|New Line|Cartoon Network Studios|Wang Film Productions|Adult Swim Films|Castle Rock Entertainment|Cinemax|Flagship|Fullscreen|Hello Sunshine|Spyglass'
print('Here are all the distributors that fall under the Warner Bros. umbrella:\n')
print(data[data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
disney = 'Walt Disney|^Fox$|Fox Atomic|A&E|Disneynature|ESPN|Fox Searchlight|Hulu|National Geographic|VICE|Fox Family|Lucasfilm|Marvel|The Muppets Studio|UTV Motion Pictures|20th Century Fox Animation|Blue Sky Studios|Lucasfilm Animation|Marvel Animation|Pixar Animation Studios|Buena Vista|Disney|Dragonfly Film Productions|Fox Star Studios|Fox Studios Australia|Kudos Film|New Regency|Patagonik Film Group|Shine Group|Tiger Aspect Productions|Zero Day Fox'
print('Here are all the distributors that fall under the Disney umbrella:\n')
print(data[data['distributor_mojo'].str.contains(disney, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(disney, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
sony = 'Sony|Columbia|Affirm|Screen Gems|Stage 6|Ghost Corps|Funimation|Madhouse|Manga Entertainment UK|TriStar|Destination Films|Left Bank Pictures|Triumph Films'
print('Here are all the distributors that fall under the Sony umbrella:\n')
print(data[data['distributor_mojo'].str.contains(sony, na=False, regex=True)]['distributor_mojo'].value_counts())
print('Sum:', data[data['distributor_mojo'].str.contains(sony, na=False, regex=True)]['distributor_mojo'].value_counts().sum())
data['universal'] = data['distributor_mojo'].str.contains(nbcuniversal, na=False, regex=True)
data['paramount'] = data['distributor_mojo'].str.contains(viacom, na=False, regex=True)
data['warner'] = data['distributor_mojo'].str.contains(warnermedia, na=False, regex=True)
data['disney'] = data['distributor_mojo'].str.contains(disney, na=False, regex=True)
data['sony'] = data['distributor_mojo'].str.contains(sony, na=False, regex=True)
data['distributor'] = np.nan
data.loc[data['universal'], 'distributor'] = 'Universal'
data.loc[data['paramount'], 'distributor'] = 'Paramount'
data.loc[data['warner'], 'distributor'] = 'Warner'
data.loc[data['disney'], 'distributor'] = 'Disney'
data.loc[data['sony'], 'distributor'] = 'Sony'
# We only want to keep rows that have one of the Big Five
data = data[data['distributor'].notna()]
figure, axis = plt.subplots()
figure.suptitle('Big 5 Share Of The Worldwide Movie Market')
data['distributor'].value_counts().plot(kind='pie')
axis.set_ylabel('');
We have very few movies from before the 1970s. We will remove these entries to simplify our analysis.
data['release_decade'].value_counts()
data = data[data['release_decade'] >= 1970]
data.info()
Our filtered dataset now has 2,433 entries.
The movie studios all have a fair chunk of the dataset. This will hopefully prevent bias stemming from lack of equitable market share.
We have no missing values, so we can do all monetary calculations safely.
In this section of the notebook, we:
# Tailored from matplotlib documentation
# https://matplotlib.org/examples/api/barchart_demo.html
# Function to add counts/percentages to bar plots
def autolabel(axis, num_decimals=0, counts=None, fontsize=20):
"""
Attach a text label above each bar displaying its height.
If sent a list of counts, display those instead.
"""
for i, val in enumerate(axis.patches):
if counts is not None:
height = counts[i]
else:
height = round(val.get_height(), num_decimals) if num_decimals > 0 else int(round(val.get_height(), 0))
# We don't want to display zeros on our bar plots
if (height == 0) or pd.isnull(height):
continue
# Put the count below a negative value bar
if height < 0:
axis.text(val.get_x() + val.get_width()/2, val.get_height()*0.95, '{}'.format(height), ha='center', va='bottom', fontsize=fontsize)
else:
axis.text(val.get_x() + val.get_width()/2, val.get_height()*1.05, '{}'.format(height), ha='center', va='bottom', fontsize=fontsize)
# Create custom function to generate the color list when graphing
def generate_color_list(colors_needed=1, order_list=['action', 'adventure', 'comedy', 'drama', 'horror', 'thriller_suspense']):
colors_available = ['color1', 'color2', 'color3']
c_list = []
# Matplotlib needs a list of colors if the graph doesn't have multiple columns per index
if colors_needed == 1:
c_list = [genres_dict[genre][colors_available[0]] for genre in order_list]
return c_list
# Matplotlib needs a list of tuples if the graph has multiple columns per index
for i in range(colors_needed):
temp_tuple = tuple([genres_dict[genre][colors_available[i]] for genre in order_list])
c_list.append(temp_tuple)
return c_list
In this section of the notebook, we:
# Create lists of useful information for graphing
genres = ['action', 'adventure', 'comedy', 'drama', 'horror', 'thriller_suspense']
colors = ['#008FD5', '#FC4F30', '#E5AE38', '#6D904F', '#8B8B8B', '#810F7C']
colors2 = ['#87C7E5', '#F4BAB0', '#F4DBA8', '#C7E2AE', '#D6D1D1', '#CE8EDB']
colors3 = ['#C5E7F7', '#F4D7D2', '#F9ECD1', '#E3F2D5', '#EAE8E8', '#ECC8F4']
# Create a dictionary holding the colors for each genre
genres_dict = {
'action': {'color1': '#008FD5', 'color2': '#87C7E5', 'color3': '#C5E7F7', 'colormap': 'Blues'},
'adventure': {'color1': '#FC4F30', 'color2': '#F4BAB0', 'color3': '#F4D7D2', 'colormap': 'Oranges'},
'comedy': {'color1': '#E5AE38', 'color2': '#F4DBA8', 'color3': '#F9ECD1', 'colormap': 'Reds'},
'drama': {'color1': '#6D904F', 'color2': '#C7E2AE', 'color3': '#E3F2D5', 'colormap': 'Greens'},
'horror': {'color1': '#8B8B8B', 'color2': '#D6D1D1', 'color3': '#EAE8E8', 'colormap': 'Greys'},
'thriller_suspense': {'color1': '#810F7C', 'color2': '#CE8EDB', 'color3': '#ECC8F4', 'colormap': 'Purples'}
}
# Create a summary statistics dataframe separated by genre to make graphing easier
# The columns are:
# Number of movies
# Average gross
# All-time gross
# Average budget
# All-time budget
# Dollar earned for dollar spent (including marketing -- adjusted budget is 1.5 times original budget)
# Median dollars earned for dollars spent
# Mean dollars earned for dollars spent
# Median profit
# Mean profit
# All-time profit
# Breakeven percentage
# Current decade (2010s) median profit
# Current decade (2010s) mean profit
# Current decade (2010s) all profit
# Current decade (2010s) breakeven percentage
aggregation_stats_per_genre = {
'num_movies': [data[genre].sum() for genre in genres],
'avg_gross': [round(data[data[genre]]['worldwide_adj'].mean() / 1000000, 1) for genre in genres],
'median_gross': [round(data[data[genre]]['worldwide_adj'].median() / 1000000, 1) for genre in genres],
'all_time_gross': [round(data[data[genre]]['worldwide_adj'].sum() / 1000000000, 1) for genre in genres],
'avg_budget': [round(data[data[genre]]['budget_adj'].mean() / 1000000, 1) for genre in genres],
'median_budget': [round(data[data[genre]]['budget_adj'].median() / 1000000, 1) for genre in genres],
'all_time_budget': [round(data[data[genre]]['budget_adj'].sum() / 1000000000, 1) for genre in genres],
'dollars_earned_for_dollars_spent': [round((data[data[genre]]['worldwide_adj'].sum() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].sum() / 1000000), 1) for genre in genres],
'median_dollars_earned_for_dollars_spent': [round((data[data[genre]]['worldwide_adj'].median() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].median() / 1000000), 1) for genre in genres],
'mean_dollars_earned_for_dollars_spent': [round((data[data[genre]]['worldwide_adj'].mean() / 2000000) / (1.5 * data[data[genre]]['budget_adj'].mean() / 1000000), 1) for genre in genres],
'median_profit': [round((data[data[genre]]['profit'].median() / 1000000), 1) for genre in genres],
'mean_profit': [round((data[data[genre]]['profit'].mean() / 1000000), 1) for genre in genres],
'all_time_profit': [round(data[data[genre]]['profit'].sum() / 1000000000, 1) for genre in genres],
'breakeven_percentage': [round(data[data[genre]]['worldwide_breakeven'].sum() / data[data[genre]]['worldwide_breakeven'].count() * 100, 1) for genre in genres],
'current_decade_median_profit': [round((data[(data[genre]) & (data['release_year'] >=2010)]['profit'].median() / 1000000), 1) for genre in genres],
'current_decade_mean_profit': [round((data[(data[genre]) & (data['release_year'] >=2010)]['profit'].mean() / 1000000), 1) for genre in genres],
'current_decade_profit': [round(data[(data[genre]) & (data['release_year'] >=2010)]['profit'].sum() / 1000000000, 1) for genre in genres],
'current_decade_breakeven_percentage': [round(data[(data[genre]) & (data['release_year'] >=2010)]['worldwide_breakeven'].mean() * 100, 1) for genre in genres]
}
# Create a summary dataframe for simple graphs
summary = pd.DataFrame(aggregation_stats_per_genre, index=genres)
summary
In this section of the notebook, we want to get a broad overview of the entire dataset.
# Create custom function to make bar graphs with our summary dataframe
def plot_summary_dataframe(summary, sort_column, plot_columns, title, colors_needed=1, legend_needed=False, legend_text=[], y_label='Millions', num_decimals=0):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20, y=1.02)
summary.sort_values(sort_column, ascending=False, inplace=True)
color_list = generate_color_list(colors_needed=colors_needed, order_list=summary.index)
summary.plot(y=plot_columns, kind='bar', ax=axis, color=color_list, legend=legend_needed)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
if legend_needed:
axis.legend(legend_text, fontsize=20)
autolabel(axis, num_decimals=num_decimals)
plt.tight_layout()
plot_summary_dataframe(summary=summary, sort_column='num_movies', plot_columns='num_movies',
title='Number of Movies Per Genre', colors_needed=1, legend_needed=False, legend_text=[], y_label='', num_decimals=0)
Let's start with some exploratory data analysis looking at the big picture.
Here, we look at overall trends of how much money the movies in our dataset have earned at the worldwide box office.
# Create custom function to plot different aggregate statistics as histograms
def plot_aggregate_histogram(data, stat, title, bins=10, color=genres_dict['action']['color2']):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,9))
figure.suptitle(title, fontsize=20)
(data[stat] / 1000000).plot.hist(bins=bins, ax=axis, fontsize=20, color=color)
axis.set_xlabel('Millions of Dollars', fontsize=20)
axis.set_ylabel('Number of Movies', fontsize=20)
axis.axvline(data[stat].median() / 1000000, color='k', linewidth=1)
axis.axvline(data[stat].mean() / 1000000, color='r', linewidth=1)
axis.legend(['Median: {:.1f} million'.format(data[stat].median() / 1000000), 'Mean: {:.1f} million'.format(data[stat].mean() / 1000000)], fontsize=20)
plot_aggregate_histogram(data=data, stat='worldwide_adj', title='Worldwide Grosses',
bins=range(0, 2000, 25), color=genres_dict['action']['color2'])
Here, we look at overall trends for production budgets for the movies in our dataset.
plot_aggregate_histogram(data=data, stat='budget_adj', title='Worldwide Budgets',
bins=range(0, 400, 10), color=genres_dict['action']['color2'])
Here, we look at overall trends for how much profit (as defined by our profitability equation) the movies in our dataset have earned.
plot_aggregate_histogram(data=data, stat='profit', title='Worldwide Profits', bins=range(-300, 1200, 25), color=genres_dict['action']['color2'])
Number of movies
Skewed grosses
Skewed budgets
Skewed profits
Use median
In this section of the notebook, we get an overall sense of how the genres compare to each other historically.
This graph shows the results of adding up the worldwide box office grosses for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_gross', plot_columns='all_time_gross',
title='Total Worldwide Gross Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of adding up the worldwide budgets for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_budget', plot_columns='all_time_budget',
title='Total Budgets Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of adding up the worldwide profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='all_time_profit', plot_columns='all_time_profit',
title='Total Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Billions', num_decimals=0)
This graph shows the results of taking the median value of worldwide profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='median_profit', plot_columns='median_profit',
title='Median Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Millions', num_decimals=0)
This graph shows the results of taking the mean value of worldwide profits for all movies, separated by genre.
plot_summary_dataframe(summary=summary, sort_column='mean_profit', plot_columns='mean_profit',
title='Mean Profit Per Genre', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Millions', num_decimals=0)
Highest gross
Profitability
Median profit
Mean profit
Thoughts
In this section of the notebook, we take a closer look at the worldwide box office grosses by genre.
plot_summary_dataframe(summary=summary, sort_column='avg_gross', plot_columns=['avg_gross', 'median_gross'],
title='Mean and Median Worldwide Gross', colors_needed=2, legend_needed=True,
legend_text=['Mean', 'Median'], y_label='Millions', num_decimals=0)
This graph shows a histogram of the worldwide box office grosses of all movies in our dataset, separated by genre.
The width of the bars is 50 million dollars. This means that each bar represents the number of movies that have grossed an amount of money somewhere in that 50 million dollar range.
Note that we are only showing those movies with domestic grosses up to 1.8 billion dollars here. This is to make the graphs easier to read by not having too much empty space spanning the larger budget values that have very few entries.
This only excludes three movies in our dataset:
# Custom function to plot histograms of a stat by genre
def plot_histograms_by_genre(data, stat, title, genres, bins=10, colors_needed=1):
figure, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, figsize=(24,15))
figure.suptitle(title, fontsize=20)
sorted_genres = sorted([{'genre': genre, 'amount': (data[data[genre]][stat].median() / 1000000)} for genre in genres], key=lambda k: k['amount'], reverse=True)
genres_list = [item['genre'] for item in sorted_genres]
color_list = generate_color_list(colors_needed=1, order_list=genres_list)
for genre, axis, color in zip(genres_list, axes.flat, color_list):
(data[data[genre]][stat] / 1000000).plot.hist(bins=bins, ax=axis, color=color)
axis.set_title(genre, fontsize=20)
axis.axvline(data[data[genre]][stat].median() / 1000000, color='k', linewidth=1)
axis.axvline(data[data[genre]][stat].mean() / 1000000, color='r', linewidth=1)
axis.legend(['Median: {:.1f} million'.format(data[data[genre]][stat].median() / 1000000), 'Mean: {:.1f} million'.format(data[data[genre]][stat].mean() / 1000000)], fontsize=15)
axis.set_xlabel('Millions', fontsize=20)
axis.set_ylabel('Number of Movies')
plot_histograms_by_genre(data=data, stat='worldwide_adj', title='Worldwide Gross Distributions',
genres=genres, bins=range(0, 1800, 50), colors_needed=1)
Skew
Mean and median
Action/Adventure
In this section, we examine the Action/Adventure subgenre to see how it affects the Gross, Budget, and Profit of the Action and Adventure genres.
print('Median gross of Action/Adventure movies: ${:.1f} million'.format(data[data['genres_mojo'] == 'Action / Adventure']['worldwide_adj'].median() / 1000000))
print('Median gross of Action (without Adventure component) movies: ${:.1f} million'.format(data[(data['genres_mojo'].str.contains('Action', na=False)) & (~data['genres_mojo'].isin(['Action / Adventure']))]['worldwide_adj'].median() / 1000000))
print('Median gross of Adventure (without Action component) movies: ${:.1f} million'.format(data[(data['genres_mojo'].str.contains('Adventure', na=False)) & (~data['genres_mojo'].isin(['Action / Adventure']))]['worldwide_adj'].median() / 1000000))
def action_adventure_stats(genre, stat, title):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20)
median_with_action_adventure = data[data[genre]][stat].median() / 1000000
median_without_action_adventure = data[(data[genre]) & (data['genres_mojo'] != 'Action / Adventure')][stat].median() / 1000000
grp = data[data[genre]].groupby('genres_mojo')[stat].median().sort_values(ascending=False) / 1000000
grp.plot(kind='bar', ax=axis, color=genres_dict[genre]['color1'])
axis.axhline(median_with_action_adventure, color='k', linewidth=1)
axis.axhline(median_without_action_adventure, color='r', linewidth=1)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.set_ylabel('Millions', fontsize=20);
axis.legend(['Overall Median With Action/Adventure: {:.1f}'.format(median_with_action_adventure),
'Overall Median Without Action/Adventure: {:.1f}'.format(median_without_action_adventure)], loc='best', fontsize=20)
autolabel(axis)
action_adventure_stats(genre='action', stat='worldwide_adj', title='Median Gross By Action Subgenres')
action_adventure_stats(genre='action', stat='budget_adj', title='Median Budget By Action Subgenres')
action_adventure_stats(genre='action', stat='profit', title='Median Profit By Action Subgenres')
action_adventure_stats(genre='adventure', stat='worldwide_adj', title='Median Gross By Adventure Subgenres')
action_adventure_stats(genre='adventure', stat='budget_adj', title='Median Budget By Adventure Subgenres')
action_adventure_stats(genre='adventure', stat='profit', title='Median Profit By Adventure Subgenres')
Action/Adventure is the culprit!
Gross
Budget
Profit
Keep it in the back of our minds
In this section of the notebook, we take a closer look at worldwide production budgets of each genre since the 1970s.
plot_summary_dataframe(summary=summary, sort_column='avg_budget', plot_columns=['avg_budget', 'median_budget'],
title='Mean and Median Production Budget', colors_needed=2, legend_needed=True,
legend_text=['Mean', 'Median'], num_decimals=0)
plot_histograms_by_genre(data=data, stat='budget_adj', title='Production Budget Distributions',
genres=genres, bins=10, colors_needed=1)
plot_aggregate_histogram(data=data, stat='budget_adj', title='Production Budgets Of All Movies',
bins=10, color=genres_dict['action']['color2'])
Median production budget
Low budgets
In this section of the notebook, we take a closer look at worldwide profits for each genre since the 1970s.
Since we define a movie's genre by all the genres it contains, many of our movies have multiple genres that we care about.
It would make life easier if, in one column, we could store a statistic and the genre of the movie.
Obviously, this involves duplication in situations where a movie has multiple genres (for example, an Action/Adventure movie counts as both an Action and Adventure movie).
These custom functions create new columns that contain the corresonding statistic (worldwide gross, budget, profit, breakeven) and whether the movie is of a certain genre.
This makes graphing certain things much easier.
# We want individual columns that hold a specific worldwide stat for each genre.
# Since a movie can have multiple genres, right now we must isolate each genre with a groupby while looping over each genre.
# If we create individual columns that contain information about a genre and a worldwide stat, it's easier to graph later.
def worldwide_stat_by_genre(row, genre, stat):
# Returns either 0 or the stat value due to boolean multiplication.
test = row[genre] * row[stat]
# If the row is not in the genre (i.e. False * $100 = 0)
if test == 0:
return np.nan
else:
return test
# We want individual columns that store breakeven information for each genre.
# Since we will be adding the entries in these columns (and using pd.DataFrame.mean()), we need to convert them to 1's and 0's.
# Thus, we need to create a separate function from the 'worldwide_stat_by_genre' function.
def test_for_breakeven_by_genre(row, genre, breakeven_column):
if row[genre]:
if row[breakeven_column]:
return 1
else:
return 0
else:
return np.nan
# List of new columns to hold worldwide stats by genre.
budget_columns = ['worldwide_budget_{}'.format(genre) for genre in genres]
gross_columns = ['worldwide_gross_{}'.format(genre) for genre in genres]
profit_columns = ['worldwide_profit_{}'.format(genre) for genre in genres]
breakeven_columns = ['worldwide_breakeven_{}'.format(genre) for genre in genres]
for genre, col in zip(genres, budget_columns):
data[col] = data.apply(lambda x: worldwide_stat_by_genre(x, genre, 'budget_adj'), axis=1)
for genre, col in zip(genres, gross_columns):
data[col] = data.apply(lambda x: worldwide_stat_by_genre(x, genre, 'worldwide_adj'), axis=1)
for genre, col in zip(genres, profit_columns):
data[col] = data.apply(lambda x: worldwide_stat_by_genre(x, genre, 'profit'), axis=1)
for genre, col in zip(genres, breakeven_columns):
data[col] = data.apply(lambda x: test_for_breakeven_by_genre(x, genre, 'worldwide_breakeven'), axis=1)
def plot_boxplot(data, genres, title, columns, starting_year=1970, y_label=''):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,15))
figure.suptitle(title, fontsize=20, y=1.05)
data[data['release_year'] >= starting_year][columns].plot(kind='box', ax=axis)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xticklabels(genres)
axis.tick_params(labelsize=20)
axis.axhline(0, color='k', linewidth=1)
plt.tight_layout()
plot_boxplot(data=data, genres=genres, title='Profit By Genre', columns=profit_columns, starting_year=1970, y_label='Billions')
Long right tails and negative medians!
def profit_by_subgenres(data, genres, aggregation_function='median', apply_function=lambda x: x / 1000000):
sorted_genres = sorted([{'genre': genre, 'amount': (data[data[genre]]['profit'].agg(aggregation_function))} for genre in genres], key=lambda k: k['amount'], reverse=True)
genres_list = [item['genre'] for item in sorted_genres]
color_list = generate_color_list(colors_needed=1, order_list=genres_list)
figure, axes = plt.subplots(nrows=6, ncols=1, figsize=(24, 54))
for genre, color, axis in zip(genres_list, color_list, axes.flat):
overall_stat = data[data[genre]]['profit'].agg(aggregation_function) / 1000000
(data[data[genre]].groupby('genres_mojo')['profit'].agg(aggregation_function).apply(apply_function).sort_values(ascending=False)).plot(kind='bar', ax=axis, color=color)
axis.axhline(overall_stat, color='k', linewidth=1)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.set_ylabel('Millions', fontsize=20)
axis.set_title('{} Profit By {} Subgenres'.format(aggregation_function.title(), genre.title()), fontsize=20, y=1.02)
axis.legend(['Overall {}: {:.1f}'.format(aggregation_function.title(), overall_stat)], loc=3, fontsize=20)
autolabel(axis)
plt.tight_layout()
profit_by_subgenres(data=data, genres=genres, aggregation_function='median', apply_function=lambda x: x / 1000000)
Movies are a tough business
profit_by_subgenres(data=data, genres=genres, aggregation_function='mean', apply_function=lambda x: x / 1000000)
The skew strongly affects the results
Here's another way to measure how successful a genre is -- you look at the ratio of earnings to expenses.
In other words, for each movie we capture (Worldwide Box Office / 2) / (1.5 * Production Budget).
Then for each genre, we can either add up all the results (i.e. see how the genre fares for every datapoint we have), take the median, or take the mean.
plot_summary_dataframe(summary=summary, sort_column='dollars_earned_for_dollars_spent',
plot_columns=['dollars_earned_for_dollars_spent', 'mean_dollars_earned_for_dollars_spent', 'median_dollars_earned_for_dollars_spent'],
title='Dollars Earned Per Dollar Spent', colors_needed=3, legend_needed=True,
legend_text=['All-Time', 'Mean', 'Median'], y_label='Dollars', num_decimals=1)
Mean versus Median is very important
all-time and mean), both Horror and Adventure are profitable.We can calculate the percentage chance a movie has to break even as another way to judge relative risk.
plot_summary_dataframe(summary=summary, sort_column='breakeven_percentage', plot_columns='breakeven_percentage',
title='Breakeven Percentage', colors_needed=1, legend_needed=False,
legend_text=[], y_label='Percent', num_decimals=1)
Less than 50%
This would be a difficult decision to make.
Action, Comedy, and Adventure have made the most overall money worldwide. Action and Adventure have far higher median worldwide grosses than Comedy. Comedy must be making up for this with its higher numbers of movies released. However, Adventure and Action are the two most expensive genres to make, whereas Comedy is fourth.
The highest aggregate return for every dollar spent comes from Horror, then Adventure.
The highest median return for every dollar spent comes from Horror, then Action.
The highest median profit per genre is Horror, then Comedy. (Note though that both these numbers are negative, since no genre has a positive median profit all-time.)
The genres with the lowest median budgets are Drama, then Horror, then Comedy.
Genres with the best chance to breakeven are Horror, then Action, then Adventure.
If our bosses are all about capital preservation and are more risk-averse as opposed to reward-inclined, I would suggest Horror and Comedy.
Horror is one of the cheapest genres to produce and yet it has the highest median profit per genre. It also has the highest chance to break even. Our bosses could make around three Horror movies for the price of a single Action movie, or four Horror movies for the price of one Adventure movie. Horror's median gross of \$94 million is well below Action's \\$206 million and Adventure's \$257 million, but it would be a safer play.
Comedy has the second cheapest median budget and the second highest median profit. Historically, it is a solid genre, as it has earned in aggregate the second-most amount of box office dollars (behind only Action). It tends to have a ceiling in terms of box office gross (only one Comedy has ever earned \$1 billion at the box office, Forrest Gump), but it's a reliable bet given it's much lower budget compared to big fare like Action and Adventure.
If our bosses really care about releasing those mega blockbusters and risk be damned, then we all know they're talking about Action and Adventure, the kings of the right tails. 16 of the top 20 all-time highest grossing movies are in one of those two genres. Three of the other four are in the Horror genre.
I would suggest sticking to those three genres.
There are a lot of reasons why studios prefer to make certain genres over others. After all, it's tough to make theme park rides based on Drama and Comedy movies. This analysis assumes we only care about how much money a movie makes at the box office.
So far, we have only analyzed these genres in aggregate.
Our bosses want more pinpoint accuracy. Which genres are the hottest right now? Which genres perform the best during which parts of the year?
So we've got more digging to do, and we'll next look for trends by Release Decade and Release Week.
We will now dive into the performance of movies by decade of release.
Up until now, we haven't been looking at our data from a time perspective. We have only been looking at movies by genre.
It's time to look at our data by genre and by decade.
To make graphing easier, we create a custom function to help us do this.
def plot_by_time_and_stat(data, genres, title, groupby_column, stat_columns, aggregate_function, apply_needed=False, apply_function=None, y_label='', y_ticks_needed=False, y_ticks='', legend_needed=True, legend_text=genres, color=colors, axhline_needed=False, axhline_value='', autolabel_needed=False, autolabel_fontsize=20):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24,9))
figure.suptitle(title, fontsize=20)
if apply_needed:
data.groupby(groupby_column)[stat_columns].agg(aggregate_function).apply(apply_function).plot(kind='bar', ax=axis, color=color)
else:
data.groupby(groupby_column)[stat_columns].agg(aggregate_function).plot(kind='bar', ax=axis, color=color)
axis.set_xlabel('')
axis.set_ylabel(y_label, fontsize=20)
axis.tick_params(labelsize=20)
if legend_needed:
axis.legend(legend_text, loc='best', fontsize=15)
if y_ticks_needed:
axis.set_yticks(y_ticks)
if axhline_needed:
axis.axhline(axhline_value, color='k', linewidth=1)
if autolabel_needed:
autolabel(axis, fontsize=autolabel_fontsize)
plot_by_time_and_stat(data=data, genres=genres, title='Genres Released By Decade',
groupby_column='release_decade', stat_columns=genres, aggregate_function='sum',
apply_needed=False, apply_function=None, y_label='Number of Movies', y_ticks_needed=False, y_ticks='')
1970s to 2000s
2000s to 2010s
Fewer movies made now
We will look at a couple graphs to get a sense of our movies without separating them by genre.
plot_by_time_and_stat(data=data, genres=genres, title='Total Box Office By Decade',
groupby_column='release_decade', stat_columns='worldwide_adj',
aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000,
y_label='Billions', y_ticks_needed=False, y_ticks='', legend_needed=False,
legend_text='', color=genres_dict['action']['color2'])
plot_by_time_and_stat(data=data, genres=genres, title='Total Box Office By Year',
groupby_column='release_year', stat_columns='worldwide_adj',
aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000,
y_label='Billions', y_ticks_needed=False, y_ticks='', legend_needed=False,
legend_text='', color=genres_dict['action']['color2'])
plot_by_time_and_stat(data=data, genres=genres, title='Total Worldwide Gross By Genre And Decade', groupby_column='release_decade', stat_columns=gross_columns, aggregate_function='sum', apply_needed=True, apply_function=lambda x: x / 1000000000, y_label='Billions', y_ticks_needed=False, y_ticks='')
# Create custom function to determine the background color for labeling the genre with the highest stat per decade
def find_genre_for_background_color(groupby_instance, decade):
column_name_list = groupby_instance.loc[decade].sort_values(ascending=False).index[0].split('_')
# Check if the split string has length 4, if so it is thriller_suspense and requires extra filtering
# The reason is our genres are 'action', 'adventure', 'comedy', 'drama', 'horror', and 'thriller_suspense'
# Our worldwide stat column names have the following form: worldwide_(stat name)_genre
# Thus five of our six genres will have length 3 when split on '_', but 'thriller_suspense' will have length 4
if len(column_name_list) == 4:
return '_'.join(column_name_list[-2:])
# If the genre is not 'thriller_suspense', we just need the last word in the list
return column_name_list[-1]
# Create a stacked bar plot of a stat by genre for each year, highlighting the genre with the highest value in each decade
def plot_stat_by_year_and_highlight_decade_winner(data, genres, title, stat_columns, aggregation_function, apply_function=None, y_label=''):
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle(title, fontsize=20, y=1.05)
# Determine background colors for each decade
grp = data.groupby('release_decade')[stat_columns].agg(aggregation_function)
bg_1970 = genres_dict[find_genre_for_background_color(grp, 1970)]['color1']
bg_1980 = genres_dict[find_genre_for_background_color(grp, 1980)]['color1']
bg_1990 = genres_dict[find_genre_for_background_color(grp, 1990)]['color1']
bg_2000 = genres_dict[find_genre_for_background_color(grp, 2000)]['color1']
bg_2010 = genres_dict[find_genre_for_background_color(grp, 2010)]['color1']
# Set up plot
grp = data.groupby('release_year')[stat_columns].agg(aggregation_function).apply(apply_function)
grp.plot(kind='bar', stacked=True, ax=axis)
axis.set_ylabel(y_label, fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.legend(genres, fontsize=20)
axis.axvspan(0, 10, color=bg_1970, alpha=0.2)
axis.axvspan(10, 20, color=bg_1980, alpha=0.2)
axis.axvspan(20, 30, color=bg_1990, alpha=0.2)
axis.axvspan(30, 40, color=bg_2000, alpha=0.2)
axis.axvspan(40, 50, color=bg_2010, alpha=0.2)
axis.axvline(10, color='k', alpha=0.2)
axis.axvline(20, color='k', alpha=0.2)
axis.axvline(30, color='k', alpha=0.2)
axis.axvline(40, color='k', alpha=0.2)
plt.tight_layout()
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Total Worldwide Gross By Genre and Year\n(Background Color Is Highest Earning Genre Per Decade)',
stat_columns=gross_columns, aggregation_function='sum',
apply_function=lambda x: x / 1000000000, y_label='Billions')
Box office increasing at a slower rate
Note
def plot_mean_and_median_by_time_and_stat(data, genres, groupby_column, stat_columns, stat_name_for_title, apply_needed=False, apply_function=None, y_label='', y_ticks_needed=False, y_ticks='', axhline_needed=False, axhline_value=''):
figure, (axis1, axis2) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
if apply_needed:
data.groupby(groupby_column)[stat_columns].agg('mean').apply(apply_function).plot(kind='bar', ax=axis1)
data.groupby(groupby_column)[stat_columns].agg('median').apply(apply_function).plot(kind='bar', ax=axis2)
else:
data.groupby(groupby_column)[stat_columns].agg('mean').plot(kind='bar', ax=axis1)
data.groupby(groupby_column)[stat_columns].agg('median').plot(kind='bar', ax=axis2)
axis1.set_ylabel(y_label, fontsize=20)
if y_ticks_needed:
axis1.set_yticks(y_ticks)
axis2.set_yticks(y_ticks)
axis1.set_xlabel('')
axis1.tick_params(labelsize=20)
axis1.legend(genres, fontsize=20)
axis1.set_title('Mean {} By Genre And Decade'.format(stat_name_for_title), fontsize=20, y=1.02)
axis2.set_ylabel(y_label, fontsize=20)
axis2.set_xlabel('')
axis2.tick_params(labelsize=20)
axis2.legend(genres, fontsize=20)
axis2.set_title('Median {} By Genre And Decade'.format(stat_name_for_title), fontsize=20, y=1.02)
if axhline_needed:
axis1.axhline(axhline_value, color='k', linewidth=1)
axis2.axhline(axhline_value, color='k', linewidth=1)
plt.tight_layout()
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=gross_columns, stat_name_for_title='Worldwide Gross',
apply_needed=True, apply_function=lambda x: x / 1000000,
y_label='Millions', y_ticks_needed=True, y_ticks=range(0, 1100, 100))
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Median Worldwide Gross By Genre and Year\n(Background Color Is Highest Grossing Median Genre Per Decade)',
stat_columns=gross_columns, aggregation_function='median',
apply_function=lambda x: x / 1000000, y_label='Millions')
Contracting period, then expanding period
Median gross change from 2000s to 2010s ranked from highest to lowest
Highest median gross by decade
def one_stat_over_time_in_separate_graphs(data, genres, title, figsize, colors, groupby_column, stat_column, aggregation_function, starting_year=1970, apply_needed=False, apply_function=None, xtick_values='', y_label='', axhline_needed=False, axhline_value=''):
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=figsize)
figure.suptitle(title, fontsize=20, y=1.02)
for genre, axis, color in zip(genres, axes.flat, colors):
# Create a series with decades as indexes and median budget as values
if apply_needed:
(data[(data[genre]) & (data['release_year'] >= starting_year)].groupby(groupby_column)[stat_column].agg(aggregation_function).apply(apply_function)).sort_index(ascending=True).plot(kind='bar', xticks=xtick_values, ax=axis, linewidth=3, color=color)
else:
(data[(data[genre]) & (data['release_year'] >= starting_year)].groupby(groupby_column)[stat_column].agg(aggregation_function)).sort_index(ascending=True).plot(kind='bar', xticks=xtick_values, ax=axis, linewidth=3, color=color)
axis.set_ylabel(y_label, fontsize=20)
axis.tick_params(labelsize=20)
axis.set_xlabel('')
axis.legend([genre], loc=2, fontsize=15)
autolabel(axis)
if axhline_needed:
axis.axhline(axhline_value, color='k', linewidth=1)
plt.tight_layout()
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Worldwide Gross By Genre and Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='worldwide_adj',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=False, axhline_value='')
1990s to 2000s
2000s to 2010s
Horror
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=budget_columns, stat_name_for_title='Worldwide Budget',
apply_needed=True, apply_function=lambda x: x / 1000000, y_label='Millions', y_ticks_needed=False, y_ticks='')
plot_stat_by_year_and_highlight_decade_winner(data=data, genres=genres,
title='Median Worldwide Budget By Genre and Year\n(Background Color Is Highest Median Genre Per Decade)',
stat_columns=budget_columns, aggregation_function='median',
apply_function=lambda x: x / 1000000, y_label='Millions')
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Budget By Genre and Release Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='budget_adj',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=False, axhline_value='')
Mean and median pretty much the same
Since the 1990s
Action and Adventure
plot_mean_and_median_by_time_and_stat(data=data, genres=genres, groupby_column='release_decade',
stat_columns=profit_columns, stat_name_for_title='Profits',
apply_needed=True, apply_function=lambda x: x / 1000000,
y_label='Millions', y_ticks_needed=False, y_ticks='', axhline_needed=True, axhline_value=0)
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Median Profit By Genre and Release Decade', figsize=(24,24),
colors=colors, groupby_column='release_decade', stat_column='profit',
aggregation_function='median', apply_needed=True, apply_function=lambda x: x / 1000000,
xtick_values=range(1970, 2020, 10), y_label='Millions', axhline_needed=True, axhline_value=0)
The average movie is not a winner
one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Breakeven Percentage By Decade', figsize=(24,16),
colors=colors, groupby_column='release_decade', stat_column='worldwide_breakeven',
aggregation_function='mean', apply_needed=True, apply_function=lambda x: x * 100,
xtick_values=range(1970, 2020, 10), y_label='Percentage', axhline_needed=False, axhline_value='')
plot_summary_dataframe(summary=summary,
sort_column='current_decade_breakeven_percentage',
plot_columns='current_decade_breakeven_percentage',
title='Current Decade Breakeven Percentage By Genre',
colors_needed=1,
legend_needed=False,
legend_text=[],
y_label='Percentage',
num_decimals=1)
1970s to 2000s
2000s to 2010s
Safest current genres
Movies are risky today
The movie business is so variable that looking at trends within subgenres probably doesn't yield much actionable insight.
But we shall look at mean and median profitability of subgenres by decade just in case.
# Function to plot mean and median profitability by subgenre by decade
def subgenre_profitability_by_decade(genre, colors):
subgenres = data[data[genre.lower()]].groupby('genres_mojo').count().index
num_subgenres = len(subgenres)
figure, axes = plt.subplots(nrows=num_subgenres, ncols=1, figsize=(24, 50), sharex=True)
figure.suptitle('Mean and Median Profit By {} Subgenre And Decade'.format(genre.title()), fontsize=20, y=1.02)
for subgenre, axis in zip(subgenres, axes.flat):
grp = (data[data['genres_mojo'] == subgenre].groupby('release_decade').agg(['mean', 'median']) / 1000000)['profit']
# If the series is missing a decade, add it as an index and set the value to zero
for decade in range(1970, 2020, 10):
if decade not in grp.index:
grp.loc[decade] = 0
# Sort the series by its index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1970, 2020, 10), color=colors, linewidth=3, ax=axis)
axis.set_ylabel('Millions', fontsize=20)
axis.set_title(subgenre, fontsize=20)
axis.legend(['Mean', 'Median'], loc='lower left', fontsize=15)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.axhline(0, color='k', linewidth=1)
autolabel(axis)
plt.tight_layout()
subgenre_profitability_by_decade('action', [genres_dict['action']['color1'], genres_dict['action']['color2']])
There's no clear best subgenre. There's too much variance by decade. In the current decade, the only subgenres with positive median profits are Action/Adventure, Action, and Sci-Fi Action.
subgenre_profitability_by_decade('adventure', [genres_dict['adventure']['color1'], genres_dict['adventure']['color2']])
Period Adventure, Action/Adventure, and Adventure are the only profitable subgenres this decade.
subgenre_profitability_by_decade('comedy', [genres_dict['comedy']['color1'], genres_dict['comedy']['color2']])
Median profitability of comedies since the 1990s hasn't been great, with most subgenres in the red for most if not all of the last three decades.
The only subgenre with positive median profitability in the 2010s is Family Comedy.
subgenre_profitability_by_decade('drama', [genres_dict['drama']['color1'], genres_dict['drama']['color2']])
Only Music Drama has a positive median profitability this decade.
subgenre_profitability_by_decade('horror', [genres_dict['horror']['color1'], genres_dict['horror']['color2']])
Only movies labeled as just Horror have a positive median profitability this decade. All other subgenres have negative median profitability.
subgenre_profitability_by_decade('thriller_suspense', [genres_dict['thriller_suspense']['color1'], genres_dict['thriller_suspense']['color2']])
The only subgenre in Thriller/Suspense that has a positive median profitability this decade is Thriller.
No subgenre stands out as a better play than any other. There's simply too much variance over time. There's also the problem of some subgenres appearing infrequently. For example, some decades are missing certain subgenres entirely.
As an example, our mean and median profitability in 'Family Adventure' movies is almost two billion dollars in the 1980s. But that is solely due to our only entry in that category being E.T., one of the most successful movies of all time. We should not conclude that our studio should start pumping out Family Adventure movies because of this information.
The movie business goes through phases like any other industry. Certain genres are hot for a minute, but then cool down for awhile. For this reason, scrutinizing subgenres over the past fifty years looking for solid trends might not yield much.
If we restrict our focus to just the last decade, we might find some insights into the current state of genres and subgenres though.
We will now dive into the performance of movies in this current decade (2010 - 2018).
budget_bins that categorizes each movie by budget size. The options are '0 - 1m', '1 - 5m', '5 - 10m', '10 - 25m', '25 - 50m', '50 - 100m', '100 - 200m', '200 - 300m', and '300 - 400m'. These represent where each movie's production budget falls (in millions of dollars). Then we create some custom functions to display worldwide profits, breakeven percentage, and the number of movies released for all genres when subdivided by budget size.one_stat_over_time_in_separate_graphs(data=data, genres=genres, title='Percentage of Movies That Breakeven This Decade', figsize=(24,16),
colors=colors, groupby_column='release_year', stat_column='worldwide_breakeven',
aggregation_function='mean', starting_year=2010, apply_needed=True, apply_function=lambda x: x * 100,
xtick_values=range(2010, 2019, 1), y_label='Percentage', axhline_needed=False, axhline_value='')
plot_summary_dataframe(summary=summary, sort_column='current_decade_breakeven_percentage',
plot_columns='current_decade_breakeven_percentage', title='Current Decade Breakeven Percentage By Genre',
colors_needed=1, legend_needed=False, legend_text=[], y_label='Percentage', num_decimals=1)
Year by year takeaways
plot_summary_dataframe(summary=summary, sort_column='current_decade_profit',
plot_columns=['current_decade_profit', 'current_decade_mean_profit', 'current_decade_median_profit'],
title='Current Decade Aggregate, Mean, and Median Profit By Genre', colors_needed=3,
legend_needed=True, legend_text=['Aggregate Profit (In Billions)', 'Mean Profit', 'Median Profit'],
y_label='Millions', num_decimals=1)
# https://matplotlib.org/3.1.0/tutorials/colors/colormap-manipulation.html
# https://stackoverflow.com/questions/1735025/how-to-normalize-a-numpy-array-to-within-a-certain-range
# https://matplotlib.org/users/gridspec.html#gridspec-and-subplotspec
# Import colormap functionality from matplotlib
import matplotlib.cm as cm
# To scale our counts array from [0,1] create custom colormap
from sklearn.preprocessing import minmax_scale
figure = plt.figure(figsize=(24,12))
figure.suptitle('Median Profit By Subgenre This Decade', fontsize=20)
gs = matplotlib.gridspec.GridSpec(50, 50)
ax1 = plt.subplot(gs[:, :-1])
ax2 = plt.subplot(gs[:, -1:])
grp = data[data['release_year'] >= 2010].groupby('genres_mojo')['profit'].agg(['median', 'count']).sort_values(by='median', ascending=False)
# Use 'viridis' colormap
viridis = cm.get_cmap('viridis')
# Normalize our counts series
scaled_counts = minmax_scale(grp['count'].astype(float), feature_range=(0,1))
# List of colors using rescaled count values
new_cmap = [viridis(item) for item in scaled_counts]
(grp['median'] / 1000000).plot(kind='bar', ax=ax1, color=new_cmap)
norm = matplotlib.colors.Normalize(vmin=grp['count'].min(), vmax=grp['count'].max())
cb1 = matplotlib.colorbar.ColorbarBase(ax2, cmap=viridis, norm=norm, orientation='vertical')
ax2.set_ylabel('Number of Movies', fontsize=20)
ax1.set_xlabel('')
ax1.set_ylabel('Millions', fontsize=20)
ax1.tick_params(labelsize=20)
Aggregate Profit
Mean Profit
Median Profit
Subgenres
Profitable Subgenres This Decade
It might help to further subdivide our genres by their budgets to look for patterns there.
bins = [0, 1000000, 5000000, 10000000, 25000000, 50000000, 100000000, 200000000, 300000000, 400000000]
group_names = ['0 - 1m', '1 - 5m', '5 - 10m', '10 - 25m', '25 - 50m', '50 - 100m', '100 - 200m', '200 - 300m', '300 - 400m']
subgenre_colors = ['#8d6a9f', '#006494', '#fcfc62', '#2d4739', '#bb342f', '#6eeb83', '#e56399', '#ffe8d4', '#57886c', '#ff7700', '#16f4d0', '#bfae48', '#90c290', '#330f0a']
data['budget_bins'] = pd.cut(data['budget_adj'], bins, labels=group_names)
# https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
# Custom function to add blue if a majority of the films break even
def background_color_blue_if_greater_than_fifty_percent(val):
if val > 0.5:
return 'background-color: {}'.format('#87C7E5')
return ''
# https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
# Custom function to highlight the max value in a series
def highlight_max(data, color='yellow'):
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_max = data == data.max()
return [attr if v else '' for v in is_max]
else: # from .apply(axis=None)
is_max = data == data.max().max()
return pd.DataFrame(np.where(is_max, attr, ''), index=data.index, columns=data.columns)
# Create custom function to display profit and count information by genre and budget size for the current decade (2010s)
def current_decade_budget_sizes(data, genre):
styler_object = (data[
(data['release_year'] >= 2010) &
(data['genres_mojo'].str.contains(genre))
][['budget_bins', 'profit', 'worldwide_breakeven']]
.apply(lambda x: x / 1000000 if x.name == 'profit' else x)
.sort_values(by=['budget_bins', 'profit'], ascending=False)
.groupby('budget_bins')
.agg(['mean', 'median', 'count', 'sum'])
.drop([('profit', 'count'), ('profit', 'mean'), ('profit', 'sum'), ('worldwide_breakeven', 'median')], axis=1)
.dropna()
.style
.applymap(background_color_blue_if_greater_than_fifty_percent, subset=[('worldwide_breakeven', 'mean')])
.apply(highlight_max, subset=[('worldwide_breakeven', 'count')])
.background_gradient('winter', subset=[('profit', 'median')]))
return styler_object
action_budget_info = current_decade_budget_sizes(data=data, genre='Action')
action_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
adventure_budget_info = current_decade_budget_sizes(data=data, genre='Adventure')
adventure_budget_info
Most produced budget
Most profitable budget (descending order)
Least profitable budget
Conclusions
action_adventure_budget_info = current_decade_budget_sizes(data=data, genre='Action / Adventure')
action_adventure_budget_info
Since Action/Adventure is one of the only profitable subgenres this decade, it warrants a closer look at this specific subgenre.
Most produced budget
Most profitable budget (descending order)
Least profitable budget
Conclusions
comedy_budget_info = current_decade_budget_sizes(data=data, genre='Comedy')
comedy_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
drama_budget_info = current_decade_budget_sizes(data=data, genre='Drama')
drama_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
horror_budget_info = current_decade_budget_sizes(data=data, genre='Horror')
horror_budget_info
Most produced budget
Most profitable budgets
Least profitable budget
Conclusions
thriller_suspense_budget_info = current_decade_budget_sizes(data=data, genre='Thriller|Suspense')
thriller_suspense_budget_info
Most produced budget
Most profitable budget
Least profitable budget
Conclusions
plot_boxplot(data=data, genres=genres, title='Profit By Genre, Current Decade', columns=profit_columns, starting_year=2010, y_label='Hundreds of Millions')
The boxplot helps shed some light on each genre's strengths and weaknesses.
Movies are an outlier-driven business
Safe versus Risky
Breaking even
Median Profit
Recommendation
Final step
We will now dive into the performance of movies by their release week in the calendar year (e.g. 1 - 53 (some years stretch into a 53rd week)).
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle('Number of Movies Released By Week\n(Seasons Are Color-Coded)', fontsize=20, y=1.05)
grp = data.groupby('release_week')[genres].sum()
grp.plot(kind='bar', stacked=True, ax=axis)
axis.set_ylabel('Count', fontsize=20)
axis.set_xlabel('')
axis.tick_params(labelsize=20)
axis.legend(genres, fontsize=20)
# Subtract one from axvspan ranges to account for it being a bar chart and not a line chart (e.g. Spring is weeks 9-22)
axis.axvspan(8, 21, alpha=0.1, facecolor='pink')
axis.axvspan(21, 34, alpha=0.1, facecolor='yellow')
axis.axvspan(34, 47, alpha=0.1, facecolor='orange')
axis.axvspan(47, 52, alpha=0.1, facecolor='green')
axis.axvspan(0, 8, alpha=0.1, facecolor='green')
plt.tight_layout()
Other than a few weeks, there have been a healthy amount of movies being released on every possible week.
Let's subdivide by genre to get a better look.
def num_movies_released_by_release_week_by_genre(data, title, starting_year=1970, genres=genres, colors=colors):
figure, axes = plt.subplots(nrows=6, ncols=1, figsize=(24, 16), sharex=True, sharey=True)
figure.suptitle(title, fontsize=20, y=1.02)
for genre, axis, color in zip(genres, axes.flat, colors):
grp = data[(data['release_year'] >= starting_year) & (data[genre])].groupby('release_week')['title'].count()
# If the series is missing a decade, add it as an index
# Then set the value to 0
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
grp.sort_index(inplace=True, ascending=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Count', fontsize=12)
axis.set_xlabel('')
axis.legend([genre], loc=2, fontsize=15)
# Subtract one from axvspan ranges to account for it being a bar chart and not a line chart
axis.axvspan(8, 21, alpha=0.1, facecolor='pink')
axis.axvspan(21, 34, alpha=0.1, facecolor='yellow')
axis.axvspan(34, 47, alpha=0.1, facecolor='orange')
axis.axvspan(47, 52, alpha=0.1, facecolor='green')
axis.axvspan(0, 8, alpha=0.1, facecolor='green')
plt.tight_layout()
num_movies_released_by_release_week_by_genre(data=data, title='Number of Movies Released By Release Week',
starting_year=1970, genres=genres, colors=colors)
Comedy has been released in good numbers in practically every week.
Drama is released the most in Fall and Winter.
Action and Adventure are released the most in Summer.
Horror and Thriller/Suspense don't really have clear patterns.
num_movies_released_by_release_week_by_genre(data=data, title='Number of Movies Released By Release Week, This Decade',
starting_year=2010, genres=genres, colors=colors)
Comedy is released consistently in more weeks than any other genre.
Drama is still weighted towards Fall and Winter weeks.
Action is concentrated on Summer releases.
Adventure, Horror, and Thriller/Suspense have less clear patterns.
# Custom function to graph a fill_between line graph of a stat's performance by release week in two ways: all-time and the current decade
def fill_between_by_release_week(data, title, stat, genres=genres, colors=colors, y_label='Millions'):
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle(title, fontsize=20, y=1.05)
for genre, axis, color in zip(genres, axes.flat, colors):
grp1 = data[(data['release_year'] < 2010) & (data[genre])].groupby('release_week')[stat].median() / 1000000
grp2 = data[(data['release_year'] >= 2010) & (data[genre])].groupby('release_week')[stat].median() / 1000000
for year in range(1, 54, 1):
if year not in grp1.index:
grp1.loc[year] = 0
if year not in grp2.index:
grp2.loc[year] = 0
# Sort the series by its index to have the decades in chronological order
grp1.sort_index(ascending=True, inplace=True)
grp2.sort_index(ascending=True, inplace=True)
axis.plot(range(1,54), grp1, color=colors[0], label='1970-2009')
axis.plot(range(1,54), grp2, color=colors[1], label='This Decade')
axis.fill_between(range(1, 54), y1=grp1, y2=grp2, where=grp2 <= grp1, facecolor=colors[0], interpolate=True, edgecolor='k')
axis.fill_between(range(1, 54), y1=grp1, y2=grp2, where=grp2 > grp1, facecolor=colors[1], interpolate=True, edgecolor='k')
axis.set_title(genre, fontsize=20)
axis.set_ylabel(y_label, fontsize=12)
axis.set_xlabel('')
axis.legend(loc=2, fontsize=15)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
fill_between_by_release_week(data=data, title='Median Gross By Release Week\n(Seasons Are Color-Coded)',
stat='worldwide_adj', genres=genres, colors=colors, y_label='Millions')
Action
Adventure
Comedy
Drama
Horror
Thriller/Suspense
fill_between_by_release_week(data=data, title='Median Budget By Release Week\n(Seasons Are Color-Coded)',
stat='budget_adj', genres=genres, colors=colors, y_label='Millions')
fill_between_by_release_week(data=data, title='Median Profit By Release Week\n(Seasons Are Color-Coded)',
stat='profit', genres=genres, colors=colors, y_label='Millions')
Similar results to Median Grosses analysis
release_weeks_with_no_movies_all_time = [0] * 6
counter = [0, 1, 2, 3, 4, 5]
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle('Percentage of Movies That Breakeven By Release Week', fontsize=20, y=1.02)
for genre, axis, color, count in zip(genres, axes.flat, colors, counter):
# Create two series with decades as indexes and count and sum as values
grp = data[data[genre]].groupby('release_week')['worldwide_breakeven'].mean() * 100
# If the series is missing a decade, add it as an index
# Then set the count to 1 and the sum to 0
# This avoids division by zero problems when calculating the percentages
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
axis.axvline(week - 1, color='white', linewidth=2)
release_weeks_with_no_movies_all_time[count] += 1
# Sort the series by their index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Percentage', fontsize=12)
axis.legend([genre], loc=2, fontsize=15)
# Show 50% breakeven line
axis.axhline(50, color='k', linewidth=1)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
It looks like historically the best chance to breakeven is with a Horror movie in December.
release_weeks_with_no_movies_this_decade = [0] * 6
counter = [0, 1, 2, 3, 4, 5]
figure, axes = plt.subplots(nrows=6, ncols=1, sharex=True, sharey=True, figsize=(24, 16))
figure.suptitle('Percentage of Movies That Breakeven By Release Week, This Decade', fontsize=20, y=1.02)
for genre, axis, color, count in zip(genres, axes.flat, colors, counter):
# Create two series with decades as indexes and count and sum as values
#grp_count = data[data[genre]].groupby('release_week')['worldwide_breakeven'].count().copy()
#grp_sum = data[data[genre]].groupby('release_week')['worldwide_breakeven'].sum().copy()
grp = data[(data['release_year'] >= 2010) & (data[genre])].groupby('release_week')['worldwide_breakeven'].mean() * 100
# If the series is missing a decade, add it as an index
# Then set the count to 1 and the sum to 0
# This avoids division by zero problems when calculating the percentages
for week in range(1, 54):
if week not in grp.index:
grp.loc[week] = 0
axis.axvline(week - 1, color='white', linewidth=3)
release_weeks_with_no_movies_this_decade[count] += 1
# Sort the series by their index to have the decades in chronological order
grp.sort_index(ascending=True, inplace=True)
grp.plot(kind='bar', xticks=range(1, 54), ax=axis, linewidth=3, color=color)
axis.set_ylabel('Percentage', fontsize=12)
axis.legend([genre], loc=2, fontsize=15)
# Show 50% breakeven line
axis.axhline(50, color='k', linewidth=1)
axis.axvspan(9, 22, alpha=0.2, color='pink')
axis.axvspan(22, 35, alpha=0.2, color='yellow')
axis.axvspan(35, 48, alpha=0.2, color='orange')
axis.axvspan(48, 53, alpha=0.2, color='green')
axis.axvspan(1, 9, alpha=0.2, color='green')
plt.tight_layout()
release_week = pd.DataFrame({'all_time': release_weeks_with_no_movies_all_time, 'this_decade': release_weeks_with_no_movies_this_decade}, index=genres)
release_week.sort_values(by='all_time', ascending=False, inplace=True)
color_list = generate_color_list(colors_needed=2, order_list=release_week.index)
figure, axis = plt.subplots(nrows=1, ncols=1, figsize=(24, 9))
figure.suptitle('Number of Release Weeks Where No Movies Have Been Released, By Genre', fontsize=20, y=1.05)
release_week.plot(kind='bar', ax=axis, color=color_list)
axis.set_ylabel('Number of Release Weeks', fontsize=20)
axis.tick_params(labelsize=20)
axis.legend(fontsize=20)
plt.tight_layout()
It seems fewer release weeks are being utilized for some genres.
This might be coincidence, or it might be that studios don't think certain weeks work for some genres.
In this decade, there are 20 weeks out of the year where a Horror movie hasn't been released!
In second place is Adventure with 15 missing weeks, then Thriller/Suspense with 13.
It seems like movies have a higher chance of breaking even this decade compared to the historical levels. This may be due to fewer movies being released overall, the growing assistance of the worlwide box office on a movie's financial picture, or studios being smarter about movie selection.
plot_by_time_and_stat(data=data, genres=genres, title='Breakeven Percentage By Decade',
groupby_column='release_decade', stat_columns=breakeven_columns,
aggregate_function='mean', apply_needed=True, apply_function=lambda x: x * 100,
y_label='Percentage', y_ticks_needed=False, y_ticks='', legend_needed=True,
legend_text=genres, color=colors, axhline_needed=True, axhline_value=50, autolabel_needed=True, autolabel_fontsize=14)
Current Decade
Our bosses want us to wrap this up at some point, so here we go...
Dear, sweet bosses,
Can you give us data on ancillary revenue streams of these movies like DVD/Blu-ray, TV Airings, and Streaming?
You can't?
You can't, or you won't?
Aha, I knew it!
So please take these conclusions with a large grain of salt. So large it might be called a heap of salt. Or a salt mountain, if you please.
Safest
Highest potential return per movie
Most calendar-friendly
Recent box office trends
Then Horror, Horror, Horror.
There's a pretty good reason Blumhouse is doing so well. It makes high quality movies that are inexpensive to produce. It's basically impossible to do that with Action or Adventure movies, but it can be done with Horror. Other studios could mimic Blumhouse's business model with the least expensive genres.
Median Budgets This Decade
How many movies could we make for the same price as a typical Action or Adventure movie (not including marketing costs)?
Number of movies per one Action movie
Number of movies per one Adventure movie
Then what's the problem? Why can't we make these low to mid budget movies at a fraction of the cost and make money on them?
The Streaming Problem
Median Grosses This Decade
Currently, Drama, Comedy, and Thriller/Suspense might be too expensive theatrically but cheap enough for streaming. Even though they have much smaller budgets, the amount of marketing dollars to wide release a movie is substantial if you aren't great at viral marketing campaigns. Blumhose is particularly good at getting the most for their marketing dollar.
Studios may be shifting a lot of low budget fare to streaming platforms, where they get predetermined fees for their content and save big on marketing dollars.
The writing seems to be on the wall. Action and Adventure are the only other genres that are doing well this decade. They tend to travel well, which means the explosion in the foreign box office market bodes well for them.
They are the most expensive genres to produce and market, but they are the big winners in terms of box office dollars.
Our dataset only includes revenue that movies generate from ticket sales, but that is only a slice of the movie revenue pie.
Stephen Follows has a great article detailing the revenue stream of movies nowadays. The following image comes from his article.

To summarize, the release windows are:
Many of these later release windows gain higher license fees if a movie is successful at the box office, making the theatrical window very important. On the other hand, theatrical isn't the only moneymaker, and movies can make up for lackluster box office with future revenue streams.
Here are some next steps to spruce up our analyis:
We have only scratched the surface in our analysis here, but our results provide very actionable insight. Genres that travel well (Action and Adventure) are earning the most these days, and Horror, due to its low cost and consistently good box office results, is a great genre to invest in.